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Abstract 

High-dimensional correlated data pose challenges in model selec- 
tion and predictive learning. The predictors can be naturally grouped 
in some applications where pursuing the between-group sparsity is 
preferred. Moreover, the problems of interest often go beyond Gaus- 
sian models. This paper provides a framework to tackle these chal- 
lenges. We derive an iterative thresholding technique which solves 
the penalized generalized linear model problem in general. We es- 
tablish rigorous convergence conditions that are much more relaxed 
than those given in the literature and practically result in a decrease 
of the number of iterations. Our theories allow for nonconvex penal- 
ties (including the Zo-penalty) and arbitrarily grouped predictors. A 
nonconvex hard-ridge penalty is advocated for joint model selection 
and prediction. A novel selective cross-validation (SCV) scheme is 
proposed for parameter tuning. We study super-resolution spectrum 
estimation and analyze real microarray data to illustrate the proposed 
methodology. 



1 Introduction 

Recent research on sparsity problems pays special attention to penalized log- 
likelihood estimation. A typical approach is the lasso (Tibshirani 1996). 
The associated /i-penalized least-squares is a convex problem and can be 
efficiently solved (Osborne et al. 2000; Efron et al. 2004; Daubechies et al. 
2004; Friedman et al. 2007). Theoretical investigations on the lasso selection 



and estimation include Zhao & Yu (2006), Donoho et al. (2006), Bunea et al. 
(2007), Zhang & Huang (2008) and the references therein. There are vari- 
ous extensions and modifications, e.g., the grouped lasso (Yuan & Lin 2006), 
Dantzig selector (Candes & Tao 2005), the adaptive lasso (Zou 2006), and the 
elastic net (Zou & Hastie 2005) among others. On the other hand, although 
the Zx-norm provides the tightest convex relaxation of the Zo- norm ; the ^i pe- 
nalization may not be selection consistent even for large sample sizes (Zhao & 
Yu 2006; Zou 2006) and may introduce estimation bias (Meinshausen 2007). 
More importantly, the lasso cannot handle collinearity (Zou & Hastie 2005). 
There is much room left for improvement in sense of both accuracy and 
parsimony for correlated large-p data such as gene expression microarrays. 

1.1 Nonconvex penalized estimators 

Recently, nonconvex approaches have been developed to meet the challenge 
- see, e.g., Antoniadis & Fan (2001), Zhang et al. (2006), Cai et al. (2007) 
among others. The popular nonconvex penalties include the /^-penalty (or 
Bridge-penalty, Frank & Friedman (1993)) with < p < 1 and the SCAD- 
penalty (Fan & Li 2001). One key challenge facing nonconvex penalized 
log-likelihoods lies in optimization, especially for nonorthogonal designs and 
nonGaussian models. Recently, Zhang (2009) proposed the MCP for least- 
squares models; for generalized linear models, the LQA (Fan & Li 2001)/per- 
turbed LQA (Hunter & Li 2005) /LLA (Zou & Li 2008) can be employed. In 
particular, the LLA introduces inherent sparsity (exact zero components) in 
the coefficient estimate by repeatedly solving an adaptive lasso problem with 
the weights updated at each iteration. 

1.2 Thresholding 

Many thresholding rules originated from wavelets denoising where the re- 
gression matrix is orthonormal. The famous examples are soft-thresholding, 
hard-thresholding, firm- shrinkage, and SCAD. Interestingly, all of these thresh- 
oldings can be obtained via minimizing a univariate penalized least-squares 
model (Antoniadis & Fan 2001). Antoniadis (2007) successfully generalized 
the result to any thresholding rule. See Section 2 for a rigorous definition of 
the threshold function. 

A breakthrough beyond orthonormal models is from the lasso computa- 
tion. Daubechies et al. (2004) showed an iterative soft-thresholded procedure 
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solves the l\ penalized least-squares. Later Friedman et al. (2007) discovered 
a coordinate descent algorithm which can be viewed as a variant of the pre- 
vious procedure. Recently, She (2009) extended the technique to nonconvex 
penalties, and studied a wide class of thresholding-based iterative selection 
procedures (TISPs) for an arbitrary thresholding operator. In computation, 
TISP is simpler than LLA because at each iteration step a low-complexity 
thresholding is performed rather than a lasso optimization. However, this 
work only applies to Gaussian models. 

The main contributions of the paper are as follows. 

• This paper provides a general framework for penalized log-likelihood 
optimization for any generalized linear model (GLM). 

• We perform rigorous theoretical convergence analysis, and the obtained 
conditions are more relaxed than those given in the literature and lead 
to a decrease of the number of iterations in applications. 

• Our theories allow for nonconvex penalties such as the Iq, bridge and 
SCAD penalties. 

• The predictors can be arbitrarily grouped, to pursue the between-group 
sparsity. 

• We perform extensive simulations to investigate the penalty design. A 
nonconvex hard-ridge penalty is advocated. 

• A novel selective cross-validation (SCV) is proposed to tune the regu- 
larization parameter. 

Compared with relevant works, our contributions are novel. Friedman 
et al. (2007) and She (2009) just considered convex/nonconvex penalized 
least-squares problems, and the predictors are not allowed to be grouped. 
Yuan & Lin (2006) first proposed the group l\ penalized least-squares prob- 
lem and developed an algorithm, but the predictors within each group must 
be orthogonal to each other. Friedman et al. (2010&) approximated the pe- 
nalized GLM problem by penalized weighted least-squares and extended the 
coordinate descent algorithm. This approximation might not provide a solu- 
tion to the original problem, and has no guarantee of convergence. We also 
noticed the recent work by Friedman et al. (2010 a) after the completion of 
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our paper. Some group penalties that are convex were studied within the 
Gaussian framework. There was seemingly no convergence analysis. 

The rest of the paper is organized as follows. Section 2 introduces neces- 
sary notation. Section 3 presents the main theorem on B-estimators. Con- 
crete examples are given in Section 4, as well as some implementation de- 
tails and variants including the relaxation form and the preliminary feature 
screening. Section 5 investigates different choices of the penalty function 
by simulation studies. Section 6 proposes a selective cross-validation (SCV) 
scheme for parameter tuning. In Section 7, super-resolution spectrum re- 
construction is studied and a real microarray data example is analyzed to 
illustrate the proposed methodology. 

2 Notation and Definitions 

This paper assumes a generalized linear model (GLM) setup that goes be- 
yond Gaussianity. Assume the observations j/i, • • ■ ,y n are independent and yi 
follows a distribution in the natural exponential family f(yf, 9i) = exp(?/j^ — 
b{6i) + c(yi)), where 0j is the natural parameter. Let Lj = log f(yi, 9i), 
L = Li- Clearly, Lj = — b(9i) + c(yj). It is well known that ^ = 
E(Vi) = b'(9i), var(yi) = b"(9i). Let X = [xi,x 2 , ••• } x n ] T = [sc 1} • • • ,x p ] 
be the model matrix. The canonical link function, denoted by g, is applied 
such that xf/3 = g(fii) = 9 i: and thus g = (b')' 1 . For instance, when ~ 

Bernoulli (n*), /(y»; 6 t ) = exp jy; log + log(l - vr^j = exp {y^ - log(l + e e *)}, 
for which 0$ = log ^z^, A*i = tt^, 6(t) = log(l + e'), and g>(t) = log ^ (the 
logit link). In the Poisson case where yi ~ Poi(a;i), f(yi',9i) = ^e~ Ui co^ = 
exp(y; logUi -Ui- logyj) = expfaOi - e dl + c(^)) with 9i = logo;*, ^ = Wj, 
= e*, and g(t) = logt (the log link). Moreover, our studies can be 
trivially generalized to the exponential dispersion family (see Section 4.2) 
which covers the Gaussian model. The Fisher information matrix at (3 is 
I((3) = X T WX with W = diag {b"(xJ/3)}. Next we give a rigorous defi- 
nition of the thresholding rule to be used as the main tool. 

Definition 2.1 (Threshold function). A threshold function is a real valued 
function G(£; A) defined for — oo < t < oo and < A < oo such that 

1. 9(-t;A) = -6(t;A), 

2. Q(t; A) < e(t'; A) /or* < 
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3. lim^oo 0(t; A) = oo, and 

4. < 0(t; A) < t /or < t < 00. 

In words, ©(•; A) is an odd monotone unbounded shrinkage rule for t, at 
any A. 6(0; A) = by definition. A vector version of 6 (still denoted by 0) 
is defined componentwise if either t or A are replaced by vectors. When both 
t and A are vectors, we assume they have the same dimension. 

Definition 2.2 (Multivariate threshold function). Given any threshold func- 
tion ©(•; A) with A as the parameter, we define its multivariate version by 

0(a;A) = a°0(||a|| 2 ;A), Vael" (2.1) 

where 

f^b ifa^O 

a ° — J ||«|[2' J ' 

\0, i/a = 0. 

Obviously, ||6(a; A)|| 2 = 0(||a|| 2 ; A) < ||a|| 2 . 

Thresholding rules are associated with penalty functions. There may 
exist multiple (or even infinitely many) penalties that correspond to the same 
threshold function. The following three-step construction finds the penalty 
with the smallest curvature (Antoniadis 2007; She 2009): 

e^M; A) = sup{t : 0(t; A) < u}, 
s(u; A) = _1 (u; A) - u, and 

r\e\ [ ] 

P(6; A) = P e (9; A) = / s(u] A) du, 



where u > holds throughout (2.2). The constructed penalty P® is nonneg- 
ative and is continuous in 9. 

Finally, by definition, it is easy to see _1 must be monotonically increas- 
ing and its derivative is defined almost everywhere. Let L® be a constant 
associated with such that 

s > —Lq a.e. 

Seen from (2.2), s' > —1 almost everywhere and so a finite Lq exists. 
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3 Main Theorem of 0-estimators 



To present a general result, we assume the predictors are grouped in this sec- 
tion. That is, the design matrix is grouped into K blocks: X = [Xi, • • • , X K ] e 
E nxp . The predictor groups never overlap but the group sizes can be differ- 
ent. When there are p groups, each being a singleton, the model reduces 
to the common 'non-grouped' GLM. We would like to study the grouped 
Pfc-penalized log-likelihood given by 

K 

F((3) 4 -L(/3) + X>*(ll/3j 2 ;A fc ). (3.1) 

k=l 

where L = J2i=i -^i (see Section 2) is the log-likelihood of the data, f3 k are the 
coefficients associated with X^, and P k are the penalty functions which allow 
to be discrete, nonconvex, and nondifferentiable at zero (like the l^-penalty) , 
to enforce sparsity on (3. A parsimonious model is usually preferred in model 
interpretation especially when the dimensionality p is much greater than the 
sample size n. 

Somehow interestingly, it is more convenient to tackle (3.1) from a thresh- 
olding viewpoint. We define the group 0-estimator induced by (9i, • • • , ®k) 
to satisfy the following nonlinear equation 

(3 k = Q k ((3 k + X T k y - X T k n(P); h), 1 < k < K, (3.2) 

where = E(yi) = g~ 1 (xj[3). Our main theorem below establishes a uni- 
versal connection of the group 0-estimators to the penalized estimators. 

We introduce the GLM version of the group TISP which generalizes She 
(2009): at each iteration step j, the new /3^ +1 > is updated through the 
multivariate thresholding 

(3 k ' +1) = & k ((3 k j) + X T k y - X T k n((3^)- \ k ), \<k<K. (3.3) 

We expect (3.3) to converge properly to the group 9-estimate under some ap- 
propriate conditions, which in turn solves the penalized log-likelihood prob- 
lem (3.1) in general. 

Theorem 3.1. For arbitrarily given thresholding rules 0& A — k < K) and 
g fljp ; suppose fyi) are the group TISP iterates (j = 1, 2, • • • ) induced by 
(0i, 02, • • • , 0rt)- F° r eac h ®k, -Pfc(') be any function satisfying 

P k (9; A fc ) - P k (0; A fc ) = PeM W + q k (0; A fc ), 
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where P@ k is obtained via (2.2), qk{-', A&) is nonnegative and c?fc(Ojfc(0; A&)) — 
for allO. Denote by A the set of{t/3 (j) + (l-t)/3 {j+1) :te(0, 1), j = 1, 2, - • • }, 
and define p = sup^ eA ||X(£)|| 2 . Define Lq 17 ... i q k = maxi<k<K Le k . Then, if 
p < max(l,2 - Le 1 ,...,e jr ), 

F(/3 (J) )-F(/3 (J ' +1) )>C||/3 W -/3 (J+1) ||2>0, j = l,2,--- (3.4) 

where C = max(l,2 — Le 1] ... j e if ) — p. //, further, p < max(l,2 — Le X) ... ,e A -)> 
t/jen an?/ /imit point o/ i/ie group TISP sequence /3^' must be a fixed point of 
(3.3), or a group Q-estimate. 

The theorem allows for p > n and applies to any threshold functions, 
even if they are not nonexpansive. To prove Theorem 3.1, we introduce the 
following lemmas. 

Lemma 3.1. Given an arbitrary thresholding rule G, let P be any function 
satisfying P(9] A) — P(0; A) = Pe(9; A) + q(9; A) where g(-; A) is nonnegative 
and q(Q(9; A)) = for all 9. Then, the minimization problem 

mmJ|| 2/ -/3||^+P(||/3|| 2 ;A)^Q(/3;A) 

has a unique optimal solution given by (3 = 0(y; A) for every y provided that 
0(-;A) is continuous at ||y|| 2 . 

Note that P (and P@) may not be differentiable at and may be noncon- 
vex. 

Proof of Lemma 3.1. First, it suffices to consider (3 satisfying > since 
Q(fl) > Q{—fl)- By definition, we have 

Q((3)-Q0) = - 2/ r (/3-^) + i(||/3||^-P||^) + Pe(||/3|| 2 ;A)-Pe(||/3|| 2 ;A) 

+ff(||i9|| 2 ;A)-g(e(||y|| a ;A)) 
rWPh 

= -y T (f3 (u + Q-\u; A) - u) du + q(\\(3\\ 2] A). 

Je>(\\yb-A) 

On the other hand, 

-y T ((3-P) = -y T /3+||y|| a e(||y|| 2 ;A) 

> H|y|| 2 ||/3|| 2 +||y|| 2 e(||y|| 2 ;A) 
= -||y||a(||/3|| 2 -e(||y|| 2 ;A) 

= -||y||2(||/3|| 2 -p|| 2 ). 
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Hence Q(f3) - Q0) > /^ll*^ 9 "^ A ) - Wvh) d« + l(\\P\k A). 

Suppose ||/3|| 2 > 6(||j/|| 2 ;A). By definition _1 (||/3|| 2 ; A) > ||y|| 2 , and 
thus Q((3) > Q{p). Furthermore, there must exist some u G [0(||t/|| 2 ; A), ||/3|| 2 ) 
s.t. _1 (w;A) > H2/H2, and hence Q(/3) > Q(/3) due to the monotonicity 
of _1 . In fact, if this were not true, we would have 0(t; A) > \\f3\\ 2 > 
0(||t/|| 2 ;A) for any t > 1 1 2/ 1 1 2 , and ©(■; A) would be discontinuous at t. A 
similar reasoning applies to the case when \\/3\\ 2 < 0( Ill/lb; A). The proof is 
now complete. □ 

Hereinafter, we always assume 0(t; A) is continuous at any t to be thresh- 
olded, since a practical thresholding rule usually has at most finitely many 
discontinuity points and such discontinuities rarely occur in real data. 



Lemma 3.2. For any a^O, \\a + b\\ 2 — [| cz.|| 2 > b T a°. 

Proof of Lemma 3.2. By definition, a° = o/||a||2 and thus b T a° + 1 1 cc (| 2 = 
a T (a + b)/||a|| 2 . The lemma follows from Cauchy's inequality. □ 



Lemma 3.3. LetQ (f3) = \\y— /3||2/2+Pe(||/3|| 2 ; A). Denote by /3 the unique 
minimizer of Qo(f3). Then for any S, Qo(0 + S) — Qo(0) > Ci||<5||!/2> where 
Ci = max(0, 1 — Lq). 

Proof of Lemma 3. 3. First we have 

Q o + S)-Q o 0) = ^0 + S-y\\l-^0-y\\l + P e (0 + S\\ 2 )-Pe(0h) 

= Ml + (P-y) T S+ s(u;X)du (3.5) 

1 J\\Ph 



(i) If fa = 0, 6(y; A) = and so 6(||y|| 2 ; A) = 0, from which it follows that 
Wvh < © _1 (0;A). Therefore, 0-y) T S > -\\y\\ 2 ■ \\S\\ 2 > -0 _1 (O; A)||<5|| 2 = 
— s (\\fl\Wi A) du. (ii) If f3 7^ 0, it is easy to verify by Lemma 3.1 that 

/3 satisfies y - p = s(\\(3\\ 2 ; X)y° = s(\\(3\\ 2 ; A)/3°, and thus - y) T 5 = 
— s(\\/3\\ 2 ; \)6 T f3 . From Lemma 3.2, we obtain 

- y) T S >-(\\6 + (3\\ 2 - ||/3|| 2 ) a (||/3|| 2 ; A) = - / s(\\(3\k A) du. 
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In either case, (3.5) can be bounded in the following way: 

Q o + 6)-Q o 0) > 1 -\\6\\l+ [ m+Sh (s(u; X) - s(0\\ 2] X)) du 

1 JwbWo. 



2 



i rWP+sW 

( (e- 1 ^; A) - e-\\\P\\ 2 ; A)) - (u - 0\\ 2 ) ) du. 

J0h 



By the Lebesgue Differentiation Theorem, (G exists almost every- 
where and 

WP+Sh rWP+Sh ru 

(Q- 1 (u ] X)-Q- 1 (\\P\\ 2] X))du> / / (0- 1 )'(i;;A)d^d M . 

0h J\\$h J\\Ph 

Lemma 3.3 is now proved by the definition of L®. □ 
Now we are ready to prove the theorem. As an alternative to F(/3) given 
by (3.1), define 



G(/3, 7 ) = -^L,,( 7 ) + ^P fe (||/3 fc || 2 ;A fc ) + ^|7-/3||^ 

i=l k=l 
n n 

- Yfl{x^) - b(xj(3)) + A*i03)(*f7 " (3-6) 



i=l i=l 



Given f3, algebraic manipulations (details omitted) show that minimizing G 
over 7 is equivalent to 

1 K 
min - || 7 - [(3 + - XV(/3)] || 2 + ]T P fc (||/3 fc || 2 ; A fc ). (3.7) 

7 fe=i 

By Lemma 3.1, the unique optimal solution can be obtained through multi- 
variate thresholding 

lk = Q k ((3 k + X T k y - Xln(P); A fc ), 1 < k < K 

even though P k may be nonconvex. This indicates the group TISP iterates 
(3.3) can be characterized by /3 < " ?+1 - ) = argmin 7 G(J3^' ,"f). Furthermore, we 
obtain for any 6 € MP 

G{( 3(J) j(3 m) + S) - G^W.^+D) > C i 5 T 5 +J2 qk (\\^ + tffc || a; Afc )(3.8) 

k 
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where C[ = max(0, 1 — L eii ... ,e A ,), by applying Lemma 3.3, and noting that 

ft(ll/3? +1) ||2) = 0. 

By Taylor series expansion, 



i=i i=i 
for some = tf3 U) + (1 - t)j3 {j+1) with t G (0, 1). Therefore, 



F(/3 (J+1) ) + ^(/3 (i+1) - (3 {j) f{I - X(^))((3 U+1) - (3 ijy ) 
F(P U) ) - - /3 (i) ) T (/3 (i+1) - /3 (J) ), 



or 



F(/3 (i) ) - F((3 U+1) ) > ^((3 u+l) - (3 {J) ) T (C[I + / -Z(£ (i) )) (/3 (j+1) - /3 (j) ) 

which proves (3.4). 

Now assume a subsequence — )■ /3* as / — >• oo. Under the condition 
p < max(l, 2 — Le^ .. ,e K ), C > and 

\\pin+i) _ pWg < (i?(0C*)) _ F(/3 (ji+1) ))/C < (F(/3 (ji) ) - F(/3 (j!+l) ))/C ->■ 0. 

That is, e fc (/3^' !) + Xjfo - XZn(J3W)',\ k ) - O. Therefore, /3* is a 

group 0-estimate. 

4 TISP for Generalized Linear Models 
4.1 Intuition and Examples 

Interestingly, the p-condition in Theorem 3.1 is imposed on the global design 
and has nothing to do with the grouping manner. This indicates no matter 
how the predictors are grouped, for an arbitrarily given model matrix, a 
simple preliminary scaling X X/ko (with a little abuse of notation) always 
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guarantees the convergence of the group TISP, provided k is large enough. 
Our main theorem helps to find a general ko in any specific GLM regardless 
of 9, A, and K. 

Example 4.1 (Gaussian GLM). If yi are Gaussian, fJ,({3) = Xf3 in (3.3). 
Because I = S = X T X, ko > \\X\\2 suffices regardless of the specific 
thresholding rules. If, further, the predictors are not grouped (K = p) and 
all Qk's are identical, (3.3) reduces to She (2009). 

Example 4.2 (Binomial GLM). If yi ~ Bernoulli^ i) as in the classification 
problem, fi(f3) = l/(l + exp(— X0)) (all operations being elementwise except 
for the matrix-vector multiplication Xf3). Now (3.3) reduces to 



= ( & + X*v - Xl 



_l + exp(-X/3 



0> 



; At 



(4.1) 



J nxp 



Further, for nongrouped predictors (K = p) and identical 's, we can use 



(i+i) 



I ^ + X T y - X 1 



1 + exp(-X/3 



(4.2) 



nxp 



In any case, since Wi = b"(xjf3) = 7r»(l — 7r») < 1/4, a somewhat crude but 
general choice is k$ > \\X\\2/2. Note that (4.2) is different than the algorithm 
in Friedman et al. (2010b) which approximates the original penalized logistic 
regression problem by a penalized weighted least-squares problem and is not 
guaranteed to converge. 

On the other hand, for all legal values of ko, our experience clearly indi- 
cates smaller k leads to faster convergence. Therefore, to reduce the com- 
putational complexity, finding the least possible k is highly desirable in 
concrete applications. The bound in Theorem 3.1 provides useful guidance 
in this regard, and seems to bee strong enough (empirically) to make an ideal 
choice for various O. 

There are rich examples of O to be used. See Figure 1 for an illustration. 
The function q in the theorem is often 0, but we use nontrivial q in Example 
4.5 and Example 4.7 to derive multiple penalties (including the discrete Iq- 
penalty) all resulting in the same O-estimator. 



Example 4.3 (Soft). When O is the soft-thresholding, L® 
A ||/3 ||i. The scaling constant can be relaxed to ko = \\X\\- 



and P((3; A) = 
2 in regression 
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Figure 1: Some examples of the thresholding rules and their corresponding penal- 
ties. Left to right: Soft, Ridge, Hard, SCAD, and Hybrid. 



and fco = ||^||2/(2v / 2) in classification. In the group case, suppose all Qk 
are soft-thresholdings, then the group TISP (3.3) solves (3.1) with penalty 
Afc||/3 fc ||2, the scaling constant being the same. This extends the grouped 
lasso (Yuan & Lin 2006) to any GLM. Furthermore, we do not have to make 
the simplistic assumption that the predictors must be orthogonal to each other 
within each group assumed by Yuan & Lin. 

Example 4.4 (Ridge). Let G be the ridge thresholding Q(t; A) = Then 
it solves the ridge regression problem with P(/3; A) = A||/3|||/2. Though in- 
ducing no sparsity, the ^-penalty often leads to better accuracy in estimation 
and prediction. 

Example 4.5 (Hard). Let be the hard-thresholding. The three-step con- 
struction yields 

PH{e , x) = \-fi^m, tfl«l<A (43) 

1 A 2 /2, if \6\ > A V ' 



and Lq = 1. Interestingly, setting 



2 



q(9; A) 



if < \9\ < X 



2 

0, if = or \0\ > A. 
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we obtain the lo-penalty P(9;X) = A 2 /2 • l^o- ^ n f ac -t, our theorem im- 
plies there exist infinitely many penalties all mimicking the lo-penalty and 
resulting in the same Q-estimate. For example, we can similarly justify 
P{9; A) = aP H (9; X/ y/a) for any a > 1 which is continuous and monotone on 
[0, oo). For grouped predictors, using hard-thresholdings for 9^ (1 < k < K) 
can attain more between-group sparsity, the corresponding penalty given by 

Example 4.6 (SCAD & Firm). Suppose is the SCAD -thresholding. Then 
we get the SCAD-penalty whose derivative is defined by 

ifd<\ 

P'(0;A) = { (a\-0)/(a- 1), if X < 9 < aX (4.4) 
, if 9 > aX 

for 9 > and a > 2. Lq can be l/(a — 1). By default, a = 3.7 (Fan 
& Li 2001). A closely related thresholding is the firm shrinkage (Gao & 
Bruce 1997) 

{0, if \t\ < aX 

*- a y> , z/«A<|t|<A (4.5) 
t, if \t\ > A, 

where < a < 1. The penalty function is aPn{t; X), where Ph is given by 
(4.3). An equivalent form of this penalty is used in MCP (Zhang 2009). 

Example 4.7 (Hard-ridge). The hybrid hard-ridge-thresholding (She 2009) 
is defined based on hard-thresholding and ridge-thresholding 

e( ( ;A,,) = (°; ™^ (4.6) 
Its penalty function fuses the hard-penalty and the ridge-penalty 

r M = H°:^> ™<£ (4.7) 

A nice fact is that by setting q(x;X,rj) = -ip-(\x\ — A) 2 1 <| x |<a we get 

= l r > 02 + \t^ u *>- < 48 > 
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Statistical modeling usually has joint concerns of accuracy and sparsity in- 
terplaying with each other during the fitting. (4.8) offers both selection and 
shrinkage into regularization. In the group situation, the multivariate ver- 
sion of (4.6) aims for a penalty of form ^=1 Ml*k\\¥o + £f=i Y^kWi 
which is able to deal with the collinearity in the design in the pursuit of 
between-group sparsity. 

4.2 TISP extensions 

The group TISP (3.3) is a neat procedure involving no high-complexity op- 
erations like matrix inversion, thus advantageous for high-dimensional com- 
putation. We introduce some extensions to the vanilla TISP. 

TISP with shift. A shift vector a may appear in the model but not 
penalized in (3.1). This occurs when an intercept term is included in the 
penalized GLM. Although for a Gaussian model one can center both X 
and y to make the intercept vanish, in general, centering the response may 
violate the distribution assumption for nonGaussian GLMs. The alternative 
optimization can be used: given f3, we can run Newton's algorithm to solve 
for a, and given a., (3 is still updated according to (3.3) but with the mean 
vector fi(/3,ac) = [p _1 (a;f/3 + CKj)] n xi- Alternatively, the weighted form of 
TISP can be used which is often more efficient. 

TISP with weights. The in (3.3) are not necessarily equal to each other. 
A component-specific A offers relative weights in regularizing the coefficients. 
The weighted form can also handle GLMs with dispersion: f(y i ;6i,4>) = 
exp[(?/j6>j — b(9i)) / (Ai<f)) + c(|/j, </>)], where is the dispersion parameter and 
Ai is a known prior weight. Note that <fi will not appear in the target function 
F{(3) (3.1) because it is a nuisance parameter orthogonal to 0j. The normal 
and binomial GLMs are concrete examples. 

TISP with relaxation, TISP with asynchronous updating. Although (3.3) 
is a nonlinear process, relaxation and asynchronous updating can be incor- 
porated to accelerate the convergence. The asynchronous updating of (3.3) 
leads to in-place computation of {3 and the mean /j, is always calculated using 
the recently updated f3. Under the assumptions that yi are Gaussian and the 
penalty is convex, this gives the coordinate descent algorithm in Friedman 
et al. (2007). Yet for nonGaussian GLMs, our experience shows that the 
synchronous form seems to be more efficient. The relaxation of (3.3) is given 
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by 



/3i J+1) = © fc (^ +1) ;A fc ), 1<A;<K. 

We used (4.9) with a; = 2 in implementation. The number of iterations can 
be reduced by about 40% in comparison to the original form. 

4.3 Some implementation details 

The group TISP (3.3), though associated with nonconvex penalties, is easy to 
implement. Our theories guarantee a local optimum given any initial point 
/3 (0) . Although one can try multiple random starts, pursuing the globally 
optimal solution to (3.1) is not at all needed to achieve dramatic performance 
gains, as will be supported by our experiments in Section 5 and Section 7. 
We recommend simply using the zero start, i.e., /3 (0) = in (3.3) or (4.9). It 
finds a 0-estimate closest to zero in some sense in building a parsimonious 
model. Warm starts for a grid of values of A may not be appropriate unless 
the problem is convex. Perhaps surprisingly, in general, the solution path 
associated with a nonconvex penalty is discontinuous in A for nonorthogonal 
designs. This is true for say SCAD, the thresholding of which is continuous 
on (0,+oo), and even the transformed l\ (Geman & Reynolds 1992) whose 
penalty and threshold function are different iable to any order on (0, +oo). 
Pathwise TISP with warm starts does not fully employ the flexibility offered 
by nonconvexity and is easy to trap into poor local optima. 

The search range of A can be determined from (3.3). For instance, as- 
suming all Ofc's are identical with A as the threshold and X has been column 
normalized, we can let A vary over the interval to \\X T y - X T n(0) H^. Fi- 
nally, it is not difficult to prove that after convergence, the hard-TISP yields 
a MLE restricted to the selected dimensions. Similarly, the hybrid TISP gives 
a restricted ridge estimate. Hence we can correct the final estimate after the 
maximum number of iterations allowed has been reached. In general, the 
nonzero coefficient estimates can be further calibrated by solving a smooth 
penalized problem (say using Newton's method) restricted to the selected 
dimensions. 
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4.4 Proportional TISP Screening 

TISP can be applied directly to large dimensions, but this may be ineffi- 
cient and unnecessary due to the existence of so many nuisance variables 
in sparsity problems. Recently, ultrahigh dimensional learning is proposed 
where current optimization packages are computationally too expensive if 
directly applied. More seriously, the huge number of features will produce 
spurious correlations and make it difficult to select the right model or es- 
timate the coefficients stably. Feature screening thus must come into play 
beforehand. Independence screenings SIS and ISIS (Fan & Lv 2008; Fan 
et al. 2009) are advocated to be performed as a fast but crude way to reduce 
the dimension to a moderate size an (say a = 0.75); in the second stage a 
more sophisticated technique such as the /i-penalty or SCAD-penalty deter- 
mines the ultimate significant predictors. Independence screenings are based 
on marginal statistics, and thus the feature ranking is meaningful when the 
features are marginally unrelated. 

We recommend feature screening by running proportional TISP: at each 
iteration step, the threshold value in (3.3) is chosen such that precisely an 
nonzero components arise in (3 (j+l) . After convergence, we obtain an candi- 
date predictors. Assuming the model is Gaussian, SIS or FDR corresponds 
to the first step if /3^°- ) = 0. Our proportional TISP screening is nonmarginal 
seen from its iterative nature and is less greedy than independence screenings. 

5 Penalty Design 

The design of the penalty P or the threshold function is the first im- 
portant issue to apply penalized methods into real-world problems. The 
SCAD seems to be popular in the literature. The firm shrinkage (4.5) is 
also used (Zhang 2009). However, the unbiasedness for large coefficients 
might not contribute to statistical modeling especially in large-p problems. 
With shrinkage introduced, we could benefit from the bias-variance tradeoff 
in predictive learning. In fact, if we had learned the truly relevant covari- 
ates, imposing a ridge-penalty would often be appropriate. The joint and 
adaptive consideration of selection and shrinkage is necessary even if we just 
focus on variable selection, because the regularization parameter is often 
tuned according to the prediction performance. Hence the hybrid hard-ridge 
thresholding seems to be an ideal choice. Its penalty (4.8) fuses the Z - n orm 
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and / 2 -norm (squared). In addition to the / -portion which enforces sparsity, 
the /2-portion handles collinearity and adapts to different noise levels. 

We carried out extensive simulations to compare different penalizations. 
Five competitors were studied: Lasso with calibration, one-step SCAD (with 
modification), and the non convex / , SCAD, and hard- ridge penalties. The 
first two are convex but multi-stage methods. For each A, we calibrated the 
lasso estimate by fitting a unpenalized likelihood model restricted to its se- 
lected predictors. This is similar to the idea of the 'LARS-OLS hybrid' (Efron 
et al. 2004). One-step SCAD is an example of the one-step LLA which fits a 
weighted lasso with the weights constructed from MLE by use of the penalty 
function. We used the previous tuned Lasso-MLE as the initial estimate for 
weight construction. It behaves better than the MLE and can be applied to 
p > n. The remaining three methods are nonconvex and can all be computed 
by TISP. Although SCAD shrinks moderate coefficients, neither of the first 
two penalties introduces estimation bias for large coefficients. Hard-ridge 
penalty does simultaneous selection and shrinkage controlled by a threshold- 
ing parameter A and a ridge parameter 77. For efficiency, we did not run a full 
two-dimensional grid search in the experiments when looking for the best pa- 
rameters. Instead, for each 77 in the grid {0.57/*, 0.05?]*, 0.00577*} where 77* is 
the optimal ridge parameter, we found A (77) to minimize the validation error. 
Then for A fixed at the optimal value, we found the best 77 to minimize the 
validation error. This simple empirical search only compares four solution 
paths and is more efficient. 

Recall that we seek to specifically evaluate and compare the performances 
of different penalties in this section. To make the comparisons fair and less 
affected by various parameter tuning plans, we generated a large validation 
dataset for tuning purposes. Concretely, we set f3 = (b, 0, b, b, 0, • • • , 0) T , 
X = [xi,x 2 ,--- ,x n ] T and Xi are i.i.d. ~ MVN(0, S) where = p^~ k \ 
1 < j,k < p. We varied the control parameters in the model: (n,p) = 
(100, 20), (100, 100), (100, 500), p = 0.1, 0.5, 0.9, and b = .75, 1, 2.5. We gen- 
erated a large test dataset with 10K observations to evaluate the performance 
of any algorithm, as well as a validation dataset of the same size to tune the 
regularization parameters. All 3 3 = 27 combinations of the problem size, cor- 
relation, and signal strength were covered in the simulations. We measured 
an algorithm's performance by prediction accuracy and sparsity recovery, for 
each model simulated 50 times. The test error is characterized by scaled 
deviance error (SDE) given by 100 • (J2f=i log fa)/ Eii log f(y%\ (3) - 1) 
defined for the test data. For stability, we reported the 40% trimmed-mean of 
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the SDEs (instead of the median) from the 50 runs. The variable selection re- 
sults are reported via three benchmark measures: the mean masking (M) and 
swamping (S) probabilities, and the rate of successful joint detection (JD). 
The masking probability is the fraction of undetected relevant variables, the 
swamping probability is the fraction of spuriously identified variables (false 
alarms), and the JD is the fraction of simulations with zero miss. Note that 
in variable selection masking is much more serious than swamping, and an 
ideal method should have M rj 0%, S « 0%, and JD « 100%. The results of 
the logistic models have been summarized and presented in Figure 2, 3, and 
4. 

First, the Lasso-MLE chose A according to the bias corrected lasso and 
alleviated the issue that even when the signal-to-noise ratio is pretty high, 
the lasso overselects (Leng et al. 2006). The nonconvex hard-TISP (Iq) yields 
a restricted MLE, too, but is single-stage, and often does better in variable 
selection. The weighting technique in one-step SCAD, though effective for 
p fixed and n — > oo, requires a careful choice of the initial estimate in finite 
samples. We also found the improvement brought by weighting was some- 
what limited, especially when some predictors are correlated. Fully solving 
the SCAD problem, although using a naive zero start, showed good large-p 
performance. The above methods or penalties are popular in the literature; 
overall we can not draw a uniform conclusion of which is the best. On the 
other hand, they are all dominated by the hard-ridge penalty. It has striking 
advantage in both prediction and sparsity recovery, in various challenging 
situations of large p, low signal strength, and/or high collinearity in our 27 
experiments. 

6 Parameter Tuning Strategy 

Parameter tuning plays a crucial role in penalized log-likelihood estimation. 
If we assume f3 is sparse and n 3> p nz , i.e., the ratio of the sample size to the 
true dimensionality is large, then BIC works well but may still suffer from 
overselection. Recently, Chen & Chen (2008) proposed large-p variants of 
BIC. Directly cross-validating the regularization parameter A is also popular 
in the literature. However, it may be inappropriate for nonconvex penalties 
due to the following reasons, (i) The optimal regularization parameter in 
(3.1) is a function of both the data and the true f3. As (X,y) change, the 
optimal parameter may not remain the same, (ii) Even if the nonconvex 
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Figure 2: Performance comparison of different penalties in terms of test error, 
masking/swamping probabilities, and joint identification rate for logistic regression 
models with b = 0.75. The sample size n = 100. 
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Figure 3: Performance comparison of different penalties in terms of test error, 
masking/swamping probabilities, and joint identification rate for logistic regression 
models with 6 = 1. The sample size n = 100. 



20 




20 100 500 

Dimension (p) 



r= — 

100 
Dimension ( 



20 100 500 

Dimension (p) 



Figure 4: Performance comparison of different penalties in terms of test error, 
masking/swamping probabilities, and joint identification rate for logistic regression 
models with b = 2.5. The sample size n = 100. 
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penalty and its corresponding threshold function are smooth on (0, oo), the 
solution path /3(A) may be discontinuous in A for nonorthogonal designs. 
Accordingly, although the i^-fold CV is widely used to tune the parameters 
in SCAD and /^-penalties, the K fitted models for the same value of A may 
not be comparable, and averaging these CV errors may be unstable and 
misleading. 

To address this issue in sparsity problems, we propose the following i^-fold 
selective cross-validation (SCV). (1) First, we run a given sparse algorithm 
on the whole dataset for A in a grid of values, getting the solution path /3 l 
and the associated sparsity patterns nzi = nz(f3[), 1 < I < L. (2) Then, for 
each / we run cross-validation to fit K models with just the predictors picked 
by nz\. (3) Finally we determine the best estimate and sparsity pattern from 
the summarized CV errors. Specifically, in Step 2, for the l\ or Iq penalty we 
apply MLE unpenalized and restricted to dimensions nzi on the data without 
the kth subset (1 < k < K). For the hard-ridge penalty, the contribution of 
the ridge parameter r\ must be considered in matching the K local estimates 
to the global estimate. In fact, the degrees of freedom of the /2-penalized 
GLM estimate (3 is given by df(/3, 77) ~ Tr{(X(/3) + 7y/) _1 X(/3)}. For any 
/, in the training step without the kth. subset, we choose 77 such that the 
restricted penalized estimate has the same df as the (3 V This can be done by 
bisection search. In Step 3, we summarize the prediction errors on the left- 
out piece of data by — Y17=i 1°S fiVi'i&i ^) — SCV(Z), where k(i) denotes 

- -k(i) 

the subset index of observation i, and f3 l denotes the estimate without 
the k(i)th. subset, restricted to the dimensions nz\ and having the same df 
as /3j. If the model is very sparse — p nz <C n and p nz <p, a BIC correction 
term can be added: SCV-BIC(Z) = 2 • SCV (I) + logn • df(/3,). Unlike Chen & 
Chen's EBIC, SCV-BIC is not affected by appending zero predictor columns 
to the model matrix. In the SCV procedure, the given sparse algorithm is 
only run once and globally (instead of K times locally) to get the solution 
path and determine the common sparsity patterns. It dramatically reduces 
the computational cost and resolves the model inconsistency issue in the 
plain CV. 
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7 Applications 



7.1 Super-resolution spectral analysis 

We apply group TISP for spectrum estimation in signal processing. Consider 
an interesting TwinSine signal as a typical example arising in applications 
such as source separation and target detection: 

y(t) = ai cos(27r/it + 4>i) + a 2 cos(27r f 2 t + <fi 2 ) + n(t) 

where a x = 2, a 2 = 3, fa = vr/3, 2 = tt/5, h = 0.25Hz, f 2 = 0.252Hz 
and n(t) is white Gaussian noise with variance a 2 . The frequency resolution 
needs to be as fine as 0.002 Hz to perceive and distinguish the two sinusoidal 
components. For convenience, assume the data sequence is evenly sampled at 
n = 100 time points ti — i, 1 < i < n and the sampling frequency f 8 is 1 Hz. 
(However, our approach does not require uniform sampling.) The overcom- 
plete dictionary to attain the desired frequency resolution is constructed by 
setting the maximum frequency / max = f s /2 = 0.5 Hz, and fk = /max • k/K 
for k = 0, 1, • • • , K with the number of frequency bins K = 250. Let 
^cos = [cos(27Ttj/ fc )] 1 <i< n) i< i; <^ and X sin = [sin(27r^/fe)]i<i<„ ! i< fc <x_i. The 
last sine atom vanishes because for integer- valued ti, sin(27rtj/^-) = 0. Then 
X = [X cos X sin ] is of dimension 100-by-499 without the intercept. Note 
that the cosine and sine components with the same frequency are naturally 
grouped. Different noise levels are considered: a 2 = 0.1,1,8. 

The classical Fourier transform based periodogram or least-squares peri- 
odogram (LSP) failed to achieve fine enough resolution by use of only 100 
observations and showed severe power leakage. On the other hand, the basis 
pursuit (BP) (Chen et al. 1998) based on the li penalization can be applied 
thanks to the sparsity assumption. We simulated the signal model at any 
given noise level 20 times to evaluate the performance of a given algorithm. 
At each run, we generated additional test data at iV = 2000 time points 
different than those of the training data to calculate the effective prediction 
error by MSE* = J2?=i(Vi ~ x iP ~ a) 2 /N - a 2 on the test dataset. The 
median of MSE* over 20 runs is reported, denoted by Err, as the goodness 
of fit of the obtained model. The frequency detection is characterized by 
JD, M, and S (see Section 5). Table 1 compares the performance of BP, 
grouped lasso, hard-ridge and grouped hard-ridge penalized regressions on 
the TwinSine signal. With the theoretical support in Section 3, all can be 
easily implemented via (3.3). 
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Table 1: Performance comparison of the basis pursuit (BP), grouped lasso (G- 
Lasso), hard-ridge and grouped hard-ridge (G-Hard-Ridge) penalized regressions 
for spectral estimation. 
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To see the true potential of each penalty, in the first four experiments 
we used additional large validation data of 2000 observations to tune the pa- 
rameters. The penalty comparison showed the dramatic improvement of the 
hybrid hard-ridge penalty over the popular ^-penalty in both time-domain 
prediction and frequency-domain spectrum reconstruction. It also suggests 
pursuing the global minimum of a nonconvex penalized regression might not 
be necessary; the zero start in TISP offers good accuracy and regulariza- 
tion. In the last experiment, only the training data (of 100 observations) are 
used for parameter tuning. We ran SCV with BIC correction. The tuned 
hard-ridge (of group form) achieved excellent performance in super-resolution 
spectral estimation. 

7.2 Classification and gene selection 

We illustrate the proposed methodology with an example of classification and 
gene selection. We analyzed the real acute lymphoblastic leukemia (ALL) 
data conducted with HG-U95Av2 Affymetrix arrays (Chiaretti et al. 2004). 
Following Scholtens & von Heydebreck (2005), we focus on the B-cell sam- 
ples and would like to contrast the patients with the BCR/ABL fusion gene 
resulting from a translocation of the chromosomes 9 and 22, with those who 
are cytogenetically normal (NEG). The preprocessed data can be loaded from 
the Bioconductor data package ALL. This leads to 2,391 probe sets and 79 
samples, 42 labeled with "NEG" and 37 labeled with "BCR/ABL". 

First, to reduce the computational complexity, we ran the proportional 
TISP in Section 4.4 to screen out some irrelevant genes. We used the hard- 
ridge 6. Specifically, for r\ in a small grid of values, we ran proportional 
TISP using the hard-ridge thresholding function. We chose r\ by 5-fold SCV. 
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We used a = 0.8 and obtained an candidate predictors having nonzero co- 
efficients. Next, we ran the original form of hard-ridge TISP with just those 
predictors and tuned the parameters by 5-fold SCV with no/AIC/BIC cor- 
rection. For comparison, we tested another two up-to-date classifiers with 
joint gene selection: the nearest shrunken centroids (Tibshirani et al. 2002) 
(denoted by NSC) and the Ebay algorithm (Efron 2009). The R package 
pamr gives an implementation of NSC and the R-code for Ebay is avail- 
able online (Efron 2009) (we revised it a little bit for prediction purposes). 
Their regularization parameters were tuned by cross-validation. To prevent 
from getting over-optimistic error rate estimates, we advocate a hierarchi- 
cal cross-validation procedure where an outer 10-fold CV is used for perfor- 
mance evaluation while the inner CVs are used for parameter tuning. Table 
2 summarizes the overall prediction and selection performances of the three 
classifiers. Hard-ridge TISP showed good performance. Hard-ridge-penalty 
+ SCV-BIC behaved the best for the given data in that it gave the small- 
est error rate and the most parsimonious model with only about 8 genes 
involved. 



Table 2: Prediction errors and number of selected genes. 
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Next we identify the relevant genes. We bootstrapped the data with 
B = 100. For each bootstrap dataset, after standardizing the predictors, we 
fit the hard-ridge penalized logistic regression with the parameters tuned by 
5-fold SCV-BIC. Figure 5 plots the frequencies of the coefficient estimates 
being nonzero and the estimate histograms over the 100 replications. The 
bootstrap results give us confidences of selecting the corresponding genes. 
The top three probesets had nonzero coefficients more frequently (> 50% of 
the time) and they jointly appeared 63 times in the selected models, the most 
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frequently visited triple in bootstrapping. We learned by annotation that all 
three probe sets - 1636_g_at, 3973CLat, and 1635_at - were associated with 
the same gene - ABL1. 




Figure 5: (a): Proportions of the coefficient estimates being nonzero over the f 00 
bootstrap replications (only the top 100 genes are plotted), (b): Histograms of 
the bootstrap coefficient estimates of the top 8 genes. 
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8 Discussion 



We built a theoretical framework for solving penalized log-likelihoods where 
the predictors can be arbitrarily grouped. In contrast to Yuan & Lin (2006), 
we do not require the predictors within the same group have been orthogo- 
nalized. Our treatment is rigorous and goes much beyond the least-squares 
model. We proved convergence conditions which are much stronger than Fried- 
man et al. (2007), Friedman et al. (20106), She (2009), and Friedman et al. 
(2010a) in theory and result in a decrease of the number of iterations in 
practice. More importantly, our framework allows for nonconvex penalties 
including the (group) Zo-penalty. (See Section 1.2 for a comparison of our 
achievements with some relevant works.) The hard-ridge penalty together 
with the SCV tuning was proposed to attain a good estimate with both ac- 
curacy and sparsity. All our studies took a viewpoint of the thresholding 
rule. 
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